Description of the project: This project uses the World Bank’s Education Statistics dataset to explore long-term trends and relationships among key indicators of the US education system. Our goal is to understand how structural aspects of education (access, staffing, and outcomes) have evolved over time and how changes in one variable may influence changes in another. We will analyze variables in different categories and see differences between sexes. By focusing on longitudinal, cross-national data, we aim to identify potential patterns and meaningful interactions that could inform educational policy and investment.
The analysis of various indicators will allow us to examine temporal changes and potential external factors that might shape different educational outcomes. We selected three major indicator categories that reflect different dimensions of the education system: Enrollment rate vs Attendance rate We will compare enrollment and attendance rates across different levels of schooling (primary, secondary, tertiary) to examine whether higher enrollment leads to consistent attendance over time. Number of teachers vs Teaching staff compensation We will investigate how changes in staffing levels correspond with compensation levels across educational stages. Educational attainment rate We will track changes in the proportion of the population and see differences between sexes achieving different levels of education to understand broader societal progress and disparities. Motivation When we were brainstorming datasets to go with, we had difficulty choosing a topic because the possibilities were endless. We looked at several options from public health to climate to economics, but we eventually narrowed it down to education, which was something that all of us are connected to and that our classmates and readers could easily relate to. We initially considered focusing on Emory-specific data but broadened our scope to the World Bank’s Education Statistics for its larger context and long-term coverage. As students at an American university, we were especially interested in examining data trends within the United States. Education policy and access here have evolved significantly over time, and exploring long-term patterns in enrollment, attendance, and staffing can reveal whether increased investment in education translates into better participation and equity. This directly relates to us, as college students who have undergone the three main levels of education and were told that our own efforts, most notably in high school, would determine our outcomes later in life. But how much do individual outcomes actually depend on the student, and how much on the school systems and structures that support them?
Structure: We will perform exploratory data analysis (EDA) in R using the tidyverse suite of libraries. Our analysis will include: Cleaning and reshaping selected indicators Visualizing trends across time and education levels Examining relationships among variables through summary statistics and correlation plots Interpreting observed trends in relation to historical or policy-level changes such as shifts in federal funding
Visualizations and stats:
#set up data for use
#load and read stuff
# install.packages("readr")
# install.packages("tidyverse")
library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.1.0
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_csv("EdStats_v01.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 367782 Columns: 58
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country name, Country code, Indicator name
## dbl (54): 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, ...
## lgl (1): 2024
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#clean up data to have USA data
data_clean <- data %>%
filter(grepl("USA", `Country code`)) %>%
# filter(!grepl("students from", `Indicator name`, ignore.case=TRUE)) %>%
# filter(!grepl("learning deprivation severity", `Indicator name`, ignore.case=TRUE)) %>%
select(-`2024`) #all na
#make sure no NA cols remain
colnames(data_clean)[colSums(is.na(data_clean)) == nrow(data_clean)]
## character(0)
write_csv(data_clean, "EdStats_USA.csv")
#filter for enrollment/attendance info
data <- read_csv("EdStats_USA.csv")
## Rows: 2125 Columns: 57
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country name, Country code, Indicator name
## dbl (54): 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_filter_attend <- data %>%
filter(grepl("total net enrolment rate|total net attendance rate", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("adjusted gender parity index", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) %>% view() %>%
write_csv("EdStats_attend.csv")
#filter for num teachers and teaching staff compensation info
data <- read_csv("EdStats_USA.csv")
## Rows: 2125 Columns: 57
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country name, Country code, Indicator name
## dbl (54): 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_filter_teacher <- data %>%
filter(grepl("teachers in|teaching staff compensation", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("female|male", `Indicator name`)) %>%
filter(!grepl("percentage of qualified", `Indicator name`, ignore.case = TRUE)) %>%
select(where(~!all(is.na(.)))) %>% view() %>%
write_csv("EdStats_teacher.csv")
#calculate and visualize
#load stuff
library(tidyverse)
# install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data_attend <- read_csv("EdStats_attend.csv")
## Rows: 6 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country name, Country code, Indicator name
## dbl (22): 1975, 1986, 1987, 1990, 1991, 1993, 1994, 1995, 1996, 1999, 2005, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#select years with most observations
data_attend_years <- data_attend %>%
select(!c(`1975`, `1986`, `1987`, `1990`, `1991`, `1993`, `1995`, `1996`, `1999`, `1994`))
#reshape data for visualization
data_long <- data_attend_years %>%
pivot_longer(cols=`2005`:`2022`, names_to="Year", values_to="Value") %>%
mutate(
Year=as.numeric(Year),
Type=case_when(
grepl("attendance", `Indicator name`, ignore.case=TRUE) ~ "Attendance",
grepl("enrolment", `Indicator name`, ignore.case=TRUE) ~ "Enrollment",
TRUE ~ "Other"
),
Level=case_when(
grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Primary School",
grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
TRUE ~ "Other"
)
) %>%
filter(Type!="Other", !is.na(Value))
#comparison plots
p_line_per_school_horizontal <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
geom_line(size=1.2, na.rm=TRUE) +
geom_point(size=2, alpha=0.8) +
facet_grid(~Level, scales="fixed", switch="y") +
theme_minimal() +
theme(
strip.background=element_rect(fill="grey90", color=NA),
panel.spacing=unit(1, "lines")
) +
labs(
title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
y="Rate (%)",
color="Variable"
) +
scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p_line_per_school_horizontal
ggsave("line_per_school_horizontal.png", p_line_per_school_horizontal)
## Saving 7 x 5 in image
#optional interactive plot i can make this so it can be embedded into an html or saved or somethibng if we want that
p_line_interactive1 <- ggplotly(p_line_per_school_horizontal)
p_line_interactive1
p_line_per_school_vertical <- ggplot(data_long, aes(x=Year, y=Value, color=Type)) +
geom_line(size=1.2, na.rm=TRUE) +
geom_point(size=2, alpha=0.8) +
facet_grid(Level~., scales="fixed", switch="y") +
theme_minimal() +
theme(
strip.background=element_rect(fill="grey90", color=NA),
panel.spacing=unit(1, "lines")
) +
labs(
title="Attendance vs. Enrollment Trends in Schools Over Time (USA)",
y="Rate (%)",
color="Variable"
) +
scale_color_manual(values=c("Attendance"="#FF69B4", "Enrollment"="#3bbf8f")
)
p_line_per_school_vertical
ggsave("line_per_school_vertical.png", p_line_per_school_vertical)
## Saving 7 x 5 in image
#optional interactive plot i can make this so it can be embedded into an html or saved or somethibng if we want that
p_line_interactive2 <- ggplotly(p_line_per_school_vertical)
p_line_interactive2
#supplemental plots about the data itself
p1 <- ggplot(data_long, aes(x=Type)) +
geom_bar(fill="#c562f0") +
theme_minimal() +
labs(title="Count of Attendance vs Enrollment Observations",
x="Variable",
y="Count")
p1
ggsave("Observation_counts_variable_attend.png", p1)
## Saving 7 x 5 in image
p2<-ggplot(data_long, aes(x=factor(Year))) +
geom_bar(fill="#c562f0") +
theme_minimal() +
labs(title="Number of Observations Per Year (All Variables)",
x="Year",
y="Count")
p2
ggsave("Observation_counts_year_attend.png", p2)
## Saving 7 x 5 in image
p3<-ggplot(data_long, aes(x=Level, fill=Type)) +
geom_bar(position="dodge") +
theme_minimal() +
labs(title="Number of Observations Per School Level",
x="School Level",
y="Count",
fill="Type")
p3
ggsave("Observation_counts_school_attend.png", p3)
## Saving 7 x 5 in image
#tendency measures
summary_stats <- data_long %>%
group_by(Type, Level) %>%
summarise(
n = n(),
mean_rate = mean(Value, na.rm=TRUE),
median_rate = median(Value, na.rm=TRUE),
sd_rate = sd(Value, na.rm=TRUE),
var_rate = var(Value, na.rm=TRUE),
min_rate = min(Value, na.rm=TRUE),
max_rate = max(Value, na.rm=TRUE),
range_rate = max_rate - min_rate,
Q1 = quantile(Value, 0.25, na.rm=TRUE),
Q3 = quantile(Value, 0.75, na.rm=TRUE),
IQR = Q3 - Q1,
) %>%
arrange(Type, Level)
## `summarise()` has grouped output by 'Type'. You can override using the
## `.groups` argument.
summary_stats
## # A tibble: 6 × 13
## # Groups: Type [2]
## Type Level n mean_rate median_rate sd_rate var_rate min_rate max_rate
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Attendan… High… 6 96.4 96.6 0.583 0.340 95.3 96.8
## 2 Attendan… Midd… 6 97.7 98.1 0.966 0.934 95.8 98.3
## 3 Attendan… Prim… 6 97.2 97.8 1.32 1.73 94.6 98.0
## 4 Enrollme… High… 11 95.0 95.2 2.19 4.79 91.4 98.8
## 5 Enrollme… Midd… 11 99.2 99.2 0.630 0.397 97.8 99.8
## 6 Enrollme… Prim… 11 98.2 98.2 1.38 1.91 95.9 99.6
## # ℹ 4 more variables: range_rate <dbl>, Q1 <dbl>, Q3 <dbl>, IQR <dbl>
write_csv(summary_stats, "tendency_attend.csv")
#calculate and visualize
#load stuff
library(tidyverse)
library(plotly)
data_teacher <- read_csv("EdStats_teacher.csv")
## Rows: 23 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country name, Country code, Indicator name
## dbl (29): 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1984, 1985, 1986, 1987, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#select data with most observations and remove data not using yet
data_teacher <- data_teacher %>%
filter(!grepl("non-teaching staff", `Indicator name`, ignore.case = TRUE)) %>%
filter(!grepl("upper-secondary", `Indicator name`, ignore.case = TRUE)) %>% #no teacher count
select(!c(`1975`, `1986`, `1987`, `1990`, `1991`, `1993`, `1995`, `1996`, `1971`, `1972`, `1973`, `1974`, `1976`, `1977`, `1984`, `1985`, `1998`, `1992`, `1994`, `1998`, `2022`, `1988`))
#Set up data for visualization
data_teachers_long <- data_teacher %>%
pivot_longer(cols=`2014`:`2021`, names_to="Year", values_to="Value") %>%
mutate(
Year = as.numeric(Year),
Type = case_when(
grepl("number", `Indicator name`, ignore.case=TRUE) ~ "Number of Teachers",
grepl("compensation", `Indicator name`, ignore.case=TRUE) ~ "Compensation % of Total Expenditure",
TRUE ~ "Other"
),
Level = case_when(
grepl("pre-primary", `Indicator name`, ignore.case=TRUE) ~ "Pre-School",
grepl("primary", `Indicator name`, ignore.case=TRUE) ~ "Elementary School",
grepl("lower secondary", `Indicator name`, ignore.case=TRUE) ~ "Middle School",
grepl("upper secondary", `Indicator name`, ignore.case=TRUE) ~ "High School",
grepl("secondary", `Indicator name`, ignore.case=TRUE) ~ "Undergraduate School",
grepl("tertiary", `Indicator name`, ignore.case=TRUE) ~ "Post-Grad Schooling",
TRUE ~ "Other"
)
) %>%
filter(Type != "Other", Level != "Other", !is.na(Value))
data_teachers_long <- data_teachers_long %>%
mutate(Value_10k = ifelse(Type == "Number of Teachers", Value / 10000, Value))
data_teachers_long <- data_teachers_long %>%
mutate(Level = factor(Level, levels = c(
"Pre-School",
"Elementary School",
"Middle School",
"High School",
"Undergraduate School",
"Post-Grad Schooling"
)))
#comparison plots
#i think ill make seperate graphs cuz these r too squished
p_line_teacher <- ggplot(data_teachers_long, aes(x=Year, y=Value_10k, color=Type)) +
geom_line(size=1.2, na.rm=TRUE) +
geom_point(size=2, alpha=0.8) +
facet_wrap(~Level, nrow=3, scales="fixed") +
theme_minimal() +
theme(
strip.background=element_rect(fill="grey90", color=NA)
) +
labs(title="Teachers & Compensation Trends Over Time (USA)",
y="Variable",
color="Variable") +
scale_color_manual(
values = c("Number of Teachers" = "#FF69B4",
"Compensation % of Total Expenditure" = "#3bbf8f"),
labels = c("Number of Teachers\nIn 10,000s", "Teacher Compensation % of\nTotal Instituation Expenditure")
) +
scale_x_continuous(breaks = seq(2014, 2021, by = 2))
p_line_teacher
ggsave("line_teacherl.png", p_line_teacher)
## Saving 7 x 5 in image
# optional interactive plot
p_teacher_interactive <- ggplotly(p_line_teacher)
p_teacher_interactive
#supplemental plots
# count per Type
p1 <-ggplot(data_teachers_long, aes(x=Type)) +
geom_bar(fill="#c562f0") +
theme_minimal() +
labs(title="Count of Observations per Type", x="Variable", y="Count")
p1
ggsave("Observation_counts_variable_teacher.png", p1)
## Saving 7 x 5 in image
# count per Year
p2 <- ggplot(data_teachers_long, aes(x=factor(Year))) +
geom_bar(fill="#c562f0") +
theme_minimal() +
labs(title="Number of Observations per Year", x="Year", y="Count")
p2
ggsave("Observation_counts_year_teacher.png", p2)
## Saving 7 x 5 in image
# count per Level
p3<-ggplot(data_teachers_long, aes(x=Level, fill=Type)) +
geom_bar(position="dodge") +
theme_minimal() +
labs(title="Number of Observations per School Level", x="Level", y="Count", fill="Variable")
p3
ggsave("Observation_counts_school_teacher.png", p3)
## Saving 7 x 5 in image
#tendency
summary_stats_teachers <- data_teachers_long %>%
group_by(Type, Level) %>%
summarise(
n = n(),
mean_value = mean(Value, na.rm=TRUE),
median_value = median(Value, na.rm=TRUE),
sd_value = sd(Value, na.rm=TRUE),
var_value = var(Value, na.rm=TRUE),
min_value = min(Value, na.rm=TRUE),
max_value = max(Value, na.rm=TRUE),
range_value = max_value - min_value,
Q1 = quantile(Value, 0.25, na.rm=TRUE),
Q3 = quantile(Value, 0.75, na.rm=TRUE),
IQR = Q3 - Q1
) %>%
arrange(Type, Level)
## `summarise()` has grouped output by 'Type'. You can override using the
## `.groups` argument.
summary_stats_teachers
## # A tibble: 12 × 13
## # Groups: Type [2]
## Type Level n mean_value median_value sd_value var_value min_value
## <chr> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Compensatio… Pre-… 6 49.0 49.0 6.83e-1 4.67e-1 48.1
## 2 Compensatio… Elem… 6 49.0 49.0 6.83e-1 4.67e-1 48.1
## 3 Compensatio… Midd… 6 49.0 49.0 6.83e-1 4.67e-1 48.1
## 4 Compensatio… High… 6 49.0 49.0 6.83e-1 4.67e-1 48.1
## 5 Compensatio… Unde… 12 46.8 47.9 2.55e+0 6.52e+0 42.8
## 6 Compensatio… Post… 6 28.5 28.3 7.90e-1 6.25e-1 27.6
## 7 Number of T… Pre-… 8 611256. 625291. 5.07e+4 2.57e+9 491844.
## 8 Number of T… Elem… 8 1729445. 1733677. 3.43e+4 1.18e+9 1687937
## 9 Number of T… Midd… 8 869296. 869613 1.56e+4 2.43e+8 844218
## 10 Number of T… High… 8 826525. 830402 1.87e+4 3.49e+8 794388.
## 11 Number of T… Unde… 8 1695822. 1700015 3.35e+4 1.12e+9 1638606.
## 12 Number of T… Post… 3 1581096 1580932 2.84e+2 8.07e+4 1580932
## # ℹ 5 more variables: max_value <dbl>, range_value <dbl>, Q1 <dbl>, Q3 <dbl>,
## # IQR <dbl>
write_csv(summary_stats_teachers, "tendency_teachers.csv")